Goto

Collaborating Authors

 version control


Deep Lake: a Lakehouse for Deep Learning

Hambardzumyan, Sasun, Tuli, Abhinav, Ghukasyan, Levon, Rahman, Fariz, Topchyan, Hrant, Isayan, David, McQuade, Mark, Harutyunyan, Mikayel, Hakobyan, Tatevik, Stranic, Ivo, Buniatyan, Davit

arXiv.org Artificial Intelligence

Traditional data lakes provide critical data infrastructure for analytical workloads by enabling time travel, running SQL queries, ingesting data with ACID transactions, and visualizing petabyte-scale datasets on cloud storage. They allow organizations to break down data silos, unlock data-driven decision-making, improve operational efficiency, and reduce costs. However, as deep learning usage increases, traditional data lakes are not well-designed for applications such as natural language processing (NLP), audio processing, computer vision, and applications involving non-tabular datasets. This paper presents Deep Lake, an open-source lakehouse for deep learning applications developed at Activeloop. Deep Lake maintains the benefits of a vanilla data lake with one key difference: it stores complex data, such as images, videos, annotations, as well as tabular data, in the form of tensors and rapidly streams the data over the network to (a) Tensor Query Language, (b) in-browser visualization engine, or (c) deep learning frameworks without sacrificing GPU utilization. Datasets stored in Deep Lake can be accessed from PyTorch, TensorFlow, JAX, and integrate with numerous MLOps tools.


MLOps vs. DevOps: What are the Similarities and Differences?

#artificialintelligence

You've almost certainly heard of DevOps before, especially if you work in the tech world--but you may or may not have heard of MLOps. A newer development in the machine learning world, MLOps is quickly taking hold due to its effective translation of classic DevOps principles. While these two disciplines are related, as the similar names indicate, they also have some key differences that you should know about. DevOps is short for software development and IT operations, and the term encompasses both the practices and tools that comprise DevOps as well as the cultural mindset behind them. DevOps represented a major shift in the IT world in the 2010s, moving away from slow, complicated processes toward faster and more iterative development.


Things You Need To Know About Data Science

#artificialintelligence

The area of data science is large and fast expanding. It's no surprise that so many people want to learn more about it! But what is data science, and what do you need to know if you want to work in this field? One of the most important things to understand about data science is that it is a very hands-on and ever-changing discipline. It's critical to keep learning new things in order to stay current with the latest trends and practices in the field.


A New Way of Managing Deep Learning Datasets - KDnuggets

#artificialintelligence

Hub by Activeloop is an open-source Python package that arranges data in Numpy-like arrays. It integrated smoothly with deep learning frameworks such as Tensorflow and PyTorch for faster GPU processing and training. We can update the data, visualize the data, and create machine learning pipelines using Hub API. Hub allows us to store images, audio, video, and time-series data in a way that can be accessed at lightning speed. The data can be stored on GCS/S3 buckets, local storage, or on Activeloop cloud.


Identify, version control, and document the best performing model during training

#artificialintelligence

Model training can be seen as the generation of subsequent versions of a model -- after each batch, the model weights are adjusted, and as a result, a new version of the model is created. Each new version will have varying levels of performance (as evaluated against a validation set). If everything goes well, training and validation loss will decrease with the number of training epochs. However, the best performing version of a model (here abbreviated as best model) is rarely the one obtained at the end of the training process. Take a typical overfitting case -- at first, both training and validation losses decrease as training progresses.


Top Python libraries of 2021 you should know about

#artificialintelligence

Welcome to a new edition (7th!) of our yearly Top Python Libraries list! Starting in December 2015 -- and uninterruptedly since then -- we have been compiling the best Python libraries that are launched or popularized every year (or late the previous year). It all started as a "Top 10" series, but although we still have 10 main picks, we are nowadays listing so many more libraries. The work the Python community has been doing is just too good, and we want to give YOU a chance to find these great libraries in case they haven't yet crossed your path. In case you are not a fan of most top-10-style posts, bear with us and give this a chance.


Machine Learning in the Browser

#artificialintelligence

Google Colaboratory, often referred to as colab, is a product created by Google to allow anyone to create and run python code in the browser. It has many standard machine and data science libraries built-in including pandas and scikit-learn. You can also install practically any other python library for use in each notebook. To access colab you need to sign up for a Google account and this then gives you free access to the notebook environment and computing resources that include GPU's. Let's walk through a quick demo.


YMIR: A Rapid Data-centric Development Platform for Vision Applications

Huang, Phoenix X., Hu, Wenze, Brendel, William, Chandraker, Manmohan, Li, Li-Jia, Wang, Xiaoyu

arXiv.org Artificial Intelligence

This paper introduces an open source platform to support the rapid development of computer vision applications at scale. The platform puts the efficient data development at the center of the machine learning development process, integrates active learning methods, data and model version control, and uses concepts such as projects to enable fast iterations of multiple task specific datasets in parallel. This platform abstracts the development process into core states and operations, and integrates third party tools via open APIs as implementations of the operations. This open design reduces the development cost and adoption cost for ML teams with existing tools. At the same time, the platform supports recording project development histories, through which successful projects can be shared to further boost model production efficiency on similar tasks. The platform is open source and is already used internally to meet the increasing demand for different real world computer vision applications.


GitHub - replicate/keepsake: Version control for machine learning

#artificialintelligence

Keepsake is a Python library that uploads files and metadata (like hyperparameters) to Amazon S3 or Google Cloud Storage. You can get the data back out using the command-line interface or a notebook. Then Keepsake will start tracking everything: code, hyperparameters, training data, weights, metrics, Python dependencies, and so on. Your experiments are all in one place, with filter and sort. Because the data's stored on S3, you can even see experiments that were run on other machines.


A New Era for Mechanical CAD

Communications of the ACM

Computer-Aided Design (CAD) has been around since the 1950s. The first graphical CAD program, called Sketchpad, came out of MIT (designworldonline.com). Since then, CAD has become essential to designing and manufacturing hardware products. Today, there are multiple types of CAD. This article focuses on mechanical CAD, used for mechanical engineering. Digging into the history of computer graphics reveals some interesting connections between the most ambitious and notorious engineers. Ivan Sutherland, who received the Turing Award for Sketchpad in 1988, had Edwin Catmull as a student.